In [2]:
import datetime
from datetime import date
import pandas as pd
import numpy as np
from plotly import __version__

import matplotlib.pyplot as plt

import plotly.offline as pyo
import plotly.graph_objs as go
from plotly.offline import iplot

This notebook which accompanies the lecture on the Bias Variance Tradeoff and Regularization at UC Berkeley for DS100 created by Joseph E. Gonzalez

Adapted for CS475 (at JHU) by Carlos Aguirre

Adapted for CS349 (at NU) by Zach Wood-Doughty

Bias vs Variance + Regularization¶

In this notebook we will adjust the number of polynomial features to control model complexity and tradeoff bias and variance.

Toy Dataset¶

To visualize the Bias vs Variance tradeoff and regularization we will use an easy to visualize syntethic dataset.

In [3]:
np.random.seed(42)
n = 50
sigma = 10

X = np.linspace(-10, 10, n)
X = np.sort(X)
Y = 2. * X + 10. + sigma * np.random.randn(n) +  20*np.sin(X) + 0.8*(X)**2
X = X/5
data_points = go.Scatter(name="data", x=X, y=Y, mode='markers')
iplot(go.Figure(data=data_points))

Train Test Splits

Scikit-learn has already built in functions to separate our dataset intro train and test sets

In [4]:
## Train Test Split
from sklearn.model_selection import train_test_split 
X_tr, X_te, Y_tr, Y_te = train_test_split(X, Y, test_size=0.25, random_state=42)
train_points = go.Scatter(name="Train Data", 
                          x=X_tr, y=Y_tr, mode='markers',  marker=dict(color="blue", symbol=0))
test_points = go.Scatter(name="Test Data",
                         x=X_te, y=Y_te, mode='markers', marker=dict(color="red", symbol="x"))
iplot([train_points, test_points], filename="toydataset-reg-lecture")

Polynomial Features: Bias vs Variance¶

In the following we work through a basic bias variance tradeoff for different numbers of polynomial features. To make things a bit more interesting we will also add an additional sine feature:

$$ \large \Phi_d(x) = \left[\sin(\omega x), x, x^2, \ldots, x^d \right] $$

with the appropriate $\omega = 5$ value determined from prior knowledge (because we created the data).

We use the following Python function to generate a $\Phi$:

In [5]:
def poly_phi(k):
    return lambda X: np.array([np.sin(X*5)] + [X ** i for i in range(1, k+1)]).T

We will consider the following basis sizes:

In [6]:
basis_sizes = [1, 2, 5, 8, 16, 24, 32]

Create a $\Phi$ feature matrix for each number of basis:

In [7]:
Phis_tr = { k: poly_phi(k)(X_tr) for k in basis_sizes }

Train a model for each number of feature

In [8]:
from sklearn import linear_model
models = {}
for k in Phis_tr:
    model = linear_model.LinearRegression()
    model.fit(Phis_tr[k], Y_tr)
    models[k] = model

Visualize the various models¶

The following code is a bit complicated but essentially makes an interactive visualization of the various models.

In [9]:
# Make the x values where plot points will be generated
X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)

# Generate the Plotly line objects by predicting the value at each X_plt
lines = []
for k in sorted(models.keys()):
    ytmp = models[k].predict(poly_phi(k)(X_plt))
    # Plotting software fails with large numbers
    ytmp[ytmp > 500] = 500
    ytmp[ytmp < -500] = -500
    lines.append(
        go.Scatter(name="degree "+ str(k), x=X_plt,
                   y = ytmp,visible=False))

# Construct steps for the interactive slider
lines[0].visible=True
steps = []
for i in range(len(lines)):
    step = dict(
        label = lines[i]['name'],
        method = 'restyle',
        args = ['visible', [False] * (len(lines)+1)],
    )
    step['args'][1][0] = True
    step['args'][1][i+1] = True # Toggle i'th trace to "visible"
    steps.append(step)

# Build the slider object
sliders = [dict(active = 0, pad = {"t": 20}, steps = steps)]

# render the plot
layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), 
                   yaxis=dict(range=[np.min(Y) -5 , np.max(Y)+5]),
                   sliders=sliders)
iplot(go.Figure(data = [train_points] + lines, layout=layout), filename="poly_curve_comparison")

Cross Validation: What degree is better?¶

With the remaining training data:

  1. Split the remaining training dataset k-ways as in the Figure above. The figure uses 5-Fold which is standard. You should split the data in the same way you constructed the test set (this is typically randomly)
  2. For each split train the model on the training fraction and then compute the error (RMSE) on the validation fraction.
  3. Average the error across each validation fraction to estimate the test error.

Define the error Function

We will continue to use the mean squared error but this time rather than define it ourselves we will import it from scikit-learn.

In [10]:
from sklearn.metrics import mean_squared_error

In the following we compute the cross validated RMSE for different numbers of polynomial basis values

In [11]:
from sklearn.model_selection import KFold
kfold_splits = 5
kfold = KFold(kfold_splits, shuffle=True, random_state=42)

mse_scores = []
poly_degrees = np.arange(1,16)
for k in poly_degrees:
    Phi = poly_phi(k)(X_tr)
    
    # One step in k-fold cross validation
    def score_model(train_index, test_index):
        model = linear_model.LinearRegression()
        model.fit(Phi[train_index,], Y_tr[train_index])
        return mean_squared_error(Y_tr[test_index], 
                                  model.predict(Phi[test_index,]))
    

    score = np.mean([score_model(tr_ind, te_ind) 
                     for (tr_ind, te_ind) in kfold.split(Phi)])
    mse_scores.append(score)
    
    
rmse_scores = np.sqrt(np.array(mse_scores))

iplot(
    go.Figure(
        data=[go.Scatter(name="CV Curve", x=poly_degrees, y=rmse_scores),
              go.Scatter(name="Optimum", x=[poly_degrees[np.argmin(rmse_scores)]], y=[np.min(rmse_scores)],
                         mode="markers", marker=dict(color="red", size=10))
             ],
        layout=go.Layout(xaxis=dict(title="Degree"), yaxis=dict(title="CV RMSE"))),
    filename="basis_function_cv_curve")

Questions?¶

  1. Is 2 a reasonable answer given the data? (Look at the earlier plot.)
  2. Does this depend on the order in which we consider basis?

Variance¶

  1. What happens if we repeat the training procedure with different versions of our training data?

  2. How will the models change as we increase the degree of our polynomial?

Here we use cross validation to simulate multiple train and test splits by splitting the training data into train and validation datasets. We then visualize each of the corresponding models.

In [12]:
from sklearn.model_selection import KFold

# Construct a KFold object which will define random index splits
kfold_splits = 5
kfold = KFold(kfold_splits, shuffle=True, random_state=42)

# Construct the lines
lines = []

# Make the test lines
X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)

# for each train validation split (we ignore the validation data)
for k in basis_sizes:
    for (tr_ind, val_ind) in kfold.split(X_tr):
        # Construct the Phi matrices for each basis size
        Phi = poly_phi(k)(X_tr[tr_ind]) 
        # Fit the model
        model = linear_model.LinearRegression()
        model.fit(Phi, Y_tr[tr_ind])
        # Make predictions at the plotting points
        ytmp = model.predict(poly_phi(k)(X_plt))
        # Plotly has a bug and fails with large numbers
        ytmp[ytmp > 500] = 500
        ytmp[ytmp < -500] = -500
        line = go.Scatter(name="degree "+ str(k), visible=False,
                          x=X_plt, y=ytmp)
        lines.append(line)

for l in lines[0:kfold_splits]:
    l.visible=True

steps = []
for i in range(len(basis_sizes)):
    step = dict(
        label = "Degree " + str(basis_sizes[i]) ,
        method = 'restyle',
        args = ['visible', [False] * (len(lines)+1)]
    )
    step['args'][1][-1] = True
    for u in range(i*kfold_splits, (i+1)*kfold_splits):
        step['args'][1][u] = True # Toggle i'th trace to "visible"
    steps.append(step)
    
sliders = [dict(
    active = 0,
    pad = {"t": 20},
    steps = steps
)]

layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), 
                   yaxis=dict(range=[np.min(Y) -5 , np.max(Y) + 5]),
                   sliders=sliders,
                   showlegend=False)

iplot(go.Figure(data = lines + [train_points], layout=layout), filename="cv_bv_slider")
    

Introduction: Regularization¶

Last time we adjusted the number of polynomial features to control model complexity and tradeoff bias and variance. However, this approach to managing model complexity has a few critical limitations:

  1. complexity varies discretely
  2. we may only need a few of the higher degree polynomial terms
  3. In general we may not have a natural way to order our basis

Rather than changing the dimension we can instead apply regularization to the weights. More generally, we can adopt the framework of regularized loss minimization.

$$ \large \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \textbf{Loss}\left(y_i, f_\theta(x_i)\right) + \lambda \textbf{R}(\theta) $$

The regularization term $\textbf{R}(\theta)$ penalizes for $\theta$ values that result in more complex and therefore higher variance models. The regularization parameter $\lambda$ determines the degree of regularization to apply and is typically determined through cross validation.

$L2$ or `Ridge' Regression¶

There are many forms for $\textbf{R}(\theta)$ but a common form is the squared $L^2$ norm of $\theta$.

$$\large \large \textbf{R}_{L^2}(\theta) = \large||\theta||_2^2 = \theta^T \theta = \sum_{k=1}^p \theta_k^2 $$

In the context of least squares regression this is often referred to as Ridge Regression with the objective:

$$ \large \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \left(y_i - f_\theta(x_i)\right)^2 + \lambda ||\theta||_2^2 $$

This is also sometimes called Tikhonov Regularization.

How does $L^2$ Regularization Help¶

The $L^2$ penalty helps in several ways:

Manages Model Complexity

  1. It ensures that uninformative features weights are relatively small (near zero) mitigating the affect of those features.
  2. It evenly distributes weight over similar features to reduce variance.

Practical Concerns

  1. It removes degeneracy created by co-linear features
  2. It improves the numerical stability of the model (problem of small samples)

Visualizing $L^2$ Regularization¶

In the following we visualize the regularization surface. Notice that it pushes weights towards zero but is relatively smooth around the origin.

In [13]:
theta_range = np.linspace(-2,2,100) 
(u,v) = np.meshgrid(theta_range, theta_range)
w_values = np.vstack((u.flatten(), v.flatten())).T

def l2_sq_reg(w):
    return np.sum(w**2)
reg_values = [l2_sq_reg(w) for w in w_values]
reg_surface = go.Surface(
    x = u, y = v,
    z = np.reshape(reg_values, u.shape),
    contours=dict(z=dict(show=True))
)

# Axis labels
layout = go.Layout(
    scene=go.Scene(
        xaxis=go.XAxis(title='w0'),
        yaxis=go.YAxis(title='w1'),
        aspectratio=dict(x=2.,y=2., z=1.), 
        camera=dict(eye=dict(x=-2, y=-2, z=2))
    )
)
fig = go.Figure(data = [reg_surface], layout = layout)
iplot(fig, filename="L2regularization")
/Users/zachwooddoughty/opt/miniconda3/envs/cs349setup/lib/python3.9/site-packages/plotly/graph_objs/_deprecations.py:544: DeprecationWarning:

plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis


/Users/zachwooddoughty/opt/miniconda3/envs/cs349setup/lib/python3.9/site-packages/plotly/graph_objs/_deprecations.py:572: DeprecationWarning:

plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis


/Users/zachwooddoughty/opt/miniconda3/envs/cs349setup/lib/python3.9/site-packages/plotly/graph_objs/_deprecations.py:489: DeprecationWarning:

plotly.graph_objs.Scene is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Scene


Applying $L^2$ Regularization using Scikit Learn¶

In the following we use the linear_model.Ridge model in scikit learn. To demonstrate the efficacy of regularization we will use the degree 32 polynomials which substantially overfit the data.

In [14]:
Phi = poly_phi(32)(X_tr)

Normalization and the Intercept¶

Before we proceed it is important that we appropriately normalize the data. Because the standard $L^2$ regularization methods treat each dimensional equivalently it is important that all dimensions are in the same range of values.

However, if we examine the polynomial features in $\Phi$ we notice that the distribution of values can be quite different for each dimension.

For example in the following we plot the degree 3 and degree 6 dimensions:

In [15]:
import plotly.figure_factory as ff
iplot(ff.create_distplot([Phi[:,3], Phi[:,6]], group_labels=['x^3', 'x^6']), filename="phi_dist_plot")

Notice:

  1. difference in spread of vakues between $x^3$ and $x^6$
  2. asymmetry- $x^6$ spreads out much more to the right

Standardizing the Data

A common transformation is to center and scale the features to zero mean and unit variance:

$$\large z = \frac{x - \mu}{\sigma} $$

This an be accomplished by applying the StandardScalar scikit learn preprocessor.

In [16]:
from sklearn.preprocessing import StandardScaler

normalizer = StandardScaler()
normalizer.fit(poly_phi(32)(X_tr))

# we define the phi function for reuse in the future
def phi_fun(X):
    return normalizer.transform(poly_phi(32)(X))

Phi = phi_fun(X_tr)

Notice in the above code snippet we define a new $\Phi$ function that applies the pre-trained normalization. This procedure learns something about the training data in the formulation of the normalizer.

  1. Will this be an issue when we cross validate on the training data?

This process of transformations: feature construction, rescaling, and then subsequently model fitting form pipelines. Scikit learn actually has a pipeline framework to aid with this process.

In the following we plot the spread of the transformed dimensions. They are still not the same but are at least on the same scale.

In [17]:
iplot(ff.create_distplot([Phi[:,3], Phi[:,6]], group_labels=['x^4', 'x^7'], bin_size=0.3), filename="phi_dist_plot2")

Fitting the Ridge Regression Model¶

We are now finally ready to fit the ridge regression model. However, we haven't yet decided on a value for the regularization parameter $\lambda$. Therefore, we will try a range of values.

In [18]:
import sklearn.linear_model as linear_model
lam_values = np.hstack((np.logspace(-8,-1,10), 
                        np.logspace(-1,2,10),
                        np.logspace(2,10,10)))

models = [
    linear_model.Ridge(alpha = lam).fit(Phi, Y_tr)
    for lam in lam_values
]

Move the slider in the following plot to see the fit for various $\lambda$ values.

In [19]:
# Make the x values where plot points will be generated
X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)

# Generate the Plotly line objects by predicting the value at each X_plt
lines = []
for k in range(len(models)):
    ytmp = models[k].predict(phi_fun(X_plt))
    # Plotting software fails with large numbers
    ytmp[ytmp > 500] = 500
    ytmp[ytmp < -500] = -500
    lines.append(
        go.Scatter(name="Lambda "+ str(lam_values[k]), 
                   x=X_plt, y = ytmp, visible=False))

# Construct steps for the interactive slider
lines[0].visible=True
steps = []
for i in range(len(lines)):
    step = dict(
        label = lines[i]['name'],
        method = 'restyle',
        args = ['visible', [False] * (len(lines)+1)],
    )
    step['args'][1][0] = True
    step['args'][1][i+1] = True # Toggle i'th trace to "visible"
    steps.append(step)

# Build the slider object
sliders = [dict(active = 0, pad = {"t": 20}, steps = steps)]

# render the plot
layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), 
                   yaxis=dict(range=[np.min(Y) -5 , np.max(Y) + 5]),
                   sliders=sliders,
                   showlegend=False)
iplot(go.Figure(data = [train_points] + lines, layout=layout), filename="ridge_regression_lines")

For large $\lambda$ values we see a smoother fit. Notice however that the model does not appear to perform well outside of the input data range. This is a common problem with polynomial feature transformations.

Using Cross Validation in the RidgeCV Model¶

Because cross validation is essential to determining the optimal regularization parameter there is built-in support for cross validation in linear_model.RidgeCV. Here we call the built-in cross validation routine passing the lambda values we wish to consider.

In [20]:
ridge_cv_model = linear_model.RidgeCV(alphas=lam_values, 
                                      store_cv_values=True)
# Fit the model to our training data
ridge_cv_model.fit(Phi, Y_tr)

# Plot the predicted model
ridge_cv_line = go.Scatter(name = "Ridge CV Curve",
                           x = X_plt,
                           y = ridge_cv_model.predict(phi_fun(X_plt)))
# render the plot
layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), 
                   yaxis=dict(range=[np.min(Y) -5 , np.max(Y)+5]))
iplot(go.Figure(data = [train_points, ridge_cv_line], layout=layout), filename="ridge_cv_line")

We can select the best performing $\lambda$ and plot the performance of different $\lambda$ values

In [21]:
ridge_cv_loss = np.sqrt(np.mean(ridge_cv_model.cv_values_,axis=0))
iplot(
    go.Figure(
        data=[go.Scatter(name="CV Curve", x=lam_values, y=ridge_cv_loss),
              go.Scatter(name="Optimum", x=[lam_values[np.argmin(ridge_cv_loss)]], y=[np.min(ridge_cv_loss)],
                         mode="markers", marker=dict(color="red", size=10))
             ],
        layout=go.Layout(xaxis=dict(title="Lambda", type="log"), 
                         yaxis=dict(title="CV RMSE", range=[10,70]))),
    filename="ridge_cv_model_curve")

$L^1$ Regularization (Lasso)¶

Another common regularization function is the sum of the absolute values:

$$\large \large \textbf{R}_{L^1}(\theta) = \sum_{k=1}^p |\theta_k| $$

This is called $L^1$ regularization as it corresponds to the $L^1$ norm. Least squares linear regression in conjunction with the $L^1$ norm is often called the Lasso (Least Absolute Shrinkage and Selection Operator).

In contrast to the $L^2$ norm the $L^1$ norm encourages $\theta_i$ values to be exactly zero in less informative dimensions thereby reducing model complexity.

In [22]:
theta_range = np.linspace(-2,2,100) 
(u,v) = np.meshgrid(theta_range, theta_range)
w_values = np.vstack((u.flatten(), v.flatten())).T

def l1_reg(w):
    return np.sum(np.abs(w))
reg_values = [l1_reg(w) for w in w_values]
reg_surface = go.Surface(
    x = u, y = v,
    z = np.reshape(reg_values, u.shape),
    contours=dict(z=dict(show=True))
)

# Axis labels
layout = go.Layout(
    scene=go.Scene(
        xaxis=go.XAxis(title='w0'),
        yaxis=go.YAxis(title='w1'),
        aspectratio=dict(x=2.,y=2., z=1.), 
        camera=dict(eye=dict(x=-2, y=-2, z=2))
    )
)
fig = go.Figure(data = [reg_surface], layout = layout)
iplot(fig, filename="L1regularization")
/Users/zachwooddoughty/opt/miniconda3/envs/cs349setup/lib/python3.9/site-packages/plotly/graph_objs/_deprecations.py:544: DeprecationWarning:

plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis


/Users/zachwooddoughty/opt/miniconda3/envs/cs349setup/lib/python3.9/site-packages/plotly/graph_objs/_deprecations.py:572: DeprecationWarning:

plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis


/Users/zachwooddoughty/opt/miniconda3/envs/cs349setup/lib/python3.9/site-packages/plotly/graph_objs/_deprecations.py:489: DeprecationWarning:

plotly.graph_objs.Scene is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Scene


Notice the shape of the curve is very different than L2. This is not smooth around the origin, and pushes values to 0 with a steeper gradient.

$L^1$ regularized regression in scikit-learn¶

In the following we use the scikit-learn Lasso package. As before we will try a range of values for the regularization parameter.

In [23]:
lam_values = np.logspace(-1.3,2.5,20)
models = []
for lam in lam_values:
    model = linear_model.Lasso(alpha = lam, max_iter=100000)
    model.fit(Phi, Y_tr)
    models.append(model)

Again we can plot the fit for different regularization penalties.

In [24]:
# Make the x values where plot points will be generated
X_plt = np.linspace(np.min(X)-1, np.max(X)+1, 200)

# Generate the Plotly line objects by predicting the value at each X_plt
lines = []
# Make the full polynomial
poly = np.array([r"\theta_0 \sin(x)"] + [r" \theta_{" + str(d) + "} x^{"+str(d)+"} " for d in range(1, 33)])


for k in range(len(models)):
    ytmp = models[k].predict(phi_fun(X_plt))
    # Plotting software fails with large numbers
    ytmp[ytmp > 500] = 500
    ytmp[ytmp < -500] = -500
    num_features = np.sum(~np.isclose(models[k].coef_, 0.))
    # get all the nonzer terms
    #     non_zero_terms = ~np.isclose(models[k].coef_,0)
    #     poly_str = "$" +("+".join(poly[non_zero_terms])) + "$"
    lines.append(
        go.Scatter(name=(
            "Lambda "+ str(lam_values[k]) + 
            " num features = " + str(num_features)
            + " out of " + str(len(models[k].coef_))), 
                   x=X_plt, y = ytmp, visible=False))

# Construct steps for the interactive slider
lines[0].visible=True
steps = []
for i in range(len(lines)):
    step = dict(
        label = lines[i]['name'],
        method = 'restyle',
        args = ['visible', [False] * (len(lines)+1)],
    )
    step['args'][1][0] = True
    step['args'][1][i+1] = True # Toggle i'th trace to "visible"
    steps.append(step)

# Build the slider object
sliders = [dict(active = 0, pad = {"t": 20}, steps = steps)]

# render the plot
layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), 
                   yaxis=dict(range=[np.min(Y) -5 , np.max(Y) + 5]),
                   sliders=sliders,
                   showlegend=False)
iplot(go.Figure(data = [train_points] + lines, layout=layout), filename="lasso_regression_lines")

Notice:¶

What happens in the above plot for larger values of $\lambda$?

Cross Validated Solution¶

As with Ridge regression, scikit-learn provides support for cross validation directly in the Lasso model training procedure. In the following we LassoCV to determine the best regularization parameter.

In [25]:
lasso_cv_model = linear_model.LassoCV(alphas=lam_values, max_iter=1000000)
# Fit the model to our training data
lasso_cv_model.fit(Phi, Y_tr)

# Plot the predicted model
lasso_cv_line = go.Scatter(name = "Lasso CV Curve",
                           x = X_plt,
                           y = lasso_cv_model.predict(phi_fun(X_plt)))
# render the plot
layout = go.Layout(xaxis=dict(range=[np.min(X_plt), np.max(X_plt)]), 
                   yaxis=dict(range=[np.min(Y) -5 , np.max(Y)+5]))
iplot(go.Figure(data = [train_points, lasso_cv_line], layout=layout), filename="lasso_cv_line")

Let's look at the polynomial terms with non-zero $\theta$

In [26]:
from IPython.display import display, Markdown 

# get all the nonzer terms
non_zero_terms = ~np.isclose(lasso_cv_model.coef_,0)

# Make the full polynomial
poly = np.array([r"\theta_0 \sin(x)"] + [r" \theta_{" + str(d) + "} x^{"+str(d)+"} " for d in range(1, 33)])

# Print only the nonzero terms
display(Markdown(r"$\large" + ("+".join(poly[non_zero_terms])) + "$"))

$\large\theta_0 \sin(x)+ \theta_{1} x^{1} + \theta_{2} x^{2} + \theta_{31} x^{31} $

Notice that only some of the $x$ values are included in the final regression model.

In [27]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold


kfold_splits = 5
kfold = KFold(kfold_splits, shuffle=True, random_state=42)

mse_scores = []
for lam in lam_values:
    # One step in k-fold cross validation
    def score_model(train_index, test_index):
        model = linear_model.Lasso(alpha=lam, max_iter=1000000)
        model.fit(Phi[train_index,:], Y_tr[train_index])
        return mean_squared_error(Y_tr[test_index], model.predict(Phi[test_index,]))
    
    mse_score = np.mean([score_model(tr_ind, te_ind) 
                     for (tr_ind, te_ind) in kfold.split(Phi)])
    mse_scores.append(mse_score)
rmse_scores = np.sqrt(np.array(mse_scores))


iplot(
    go.Figure(
        data=[go.Scatter(name="CV Curve", x=lam_values, y=rmse_scores),
              go.Scatter(name="Optimum", x=[lam_values[np.argmin(rmse_scores)]], y=[np.min(rmse_scores)],
                         mode="markers", marker=dict(color="red", size=10))
             ],
        layout=go.Layout(xaxis=dict(title="Lambda",type="log",range=[-1.2,1.5]), 
                         yaxis=dict(title="CV RMSE", range=[5,30]))),
    filename="lasso_cv_model_curve")
In [ ]:
 

We can again see the effect of different lambda.

In [ ]: